1. Outliers by definition

Outliers are values within the dataset that vary greatly from the others. They’re either much larger, or significantly smaller. Outliers may indicate variabilities in a measurement, experimental errors, or a novelty.

Cleaning up the data requires some kind of treatment of extreme values to make sure we don’t misinterpret the results due to non representative instances. If we’re analysing housing prices for a city, a generic example could be mixing up the average apartment units with luxury houses. This, might lead to wrong results…


2. Understanting and preparation the database

For this analysis, we’ll be analysis the housing price in Buenos Aires, using the following libraries:

library(ggplot2)
library(knitr)
library(leaflet)
library(sf)
library(tidyverse)
library(ggridges)

#For any missing library, remember to <install.packages('PACKAGE-NAME')>

options(scipen=999) #turn off scientific notation

2.1 Loading the data

Let’s load the all the properties on sale published on Argenprop during 2021 and explore what’s in it. Wel’ll also load the geographical boundaries of the city, available in Buenos Aires Data.

#We'll be using these 2 shape files:

neighbourhoods <- st_read('data/neighbourhood/barrios_badata.shp') %>% 
        st_transform(4326) %>% 
    rename(NEIGHBOURHOOD=BARRIO) %>% 
    select(NEIGHBOURHOOD)
## Reading layer `barrios_badata' from data source 
##   `D:\Escritorio\Proyectos\Outliers-detection\data\neighbourhood\barrios_badata.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 48 features and 5 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 93743.42 ymin: 91566.42 xmax: 111751.4 ymax: 111401.7
## Projected CRS: Argentina_GKBsAs
properties <- st_read('data/real-estate/Deptos_Vta_Anual_2021.shp') %>% 
    st_transform(4326) %>% 
    st_join(neighbourhoods) # adding geo info
## Reading layer `Deptos_Vta_Anual_2021' from data source 
##   `D:\Escritorio\Proyectos\Outliers-detection\data\real-estate\Deptos_Vta_Anual_2021.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 11117 features and 30 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 93950.97 ymin: 93128.01 xmax: 109911.9 ymax: 110363.6
## Projected CRS: Argentina_GKBsAs
names(properties)
##  [1] "OBJECTID"      "DIRECCION"     "NUMERO"        "M2"           
##  [5] "DOLARES"       "U_S_M2"        "PESOS"         "PESOS_M2"     
##  [9] "PISO"          "AMBIENTES"     "DORMITORIO"    "ANTIG_EDAD"   
## [13] "A_ESTRENAR"    "FRENTE_Y_C"    "EDIFICACIO"    "BAUL"         
## [17] "COCH"          "BA_OS"         "LAVADERO"      "TERRAZA___"   
## [21] "AMENITIES"     "COTIZACION"    "OBS1"          "BARRIOS"      
## [25] "COMUNA"        "Zona_1"        "LINKS"         "TRIMESTRE"    
## [29] "CALLE"         "AMENTIES"      "NEIGHBOURHOOD" "geometry"


We’ll keep the property prices, surface and neighbourhood columns:

properties <-  properties %>%  
    select(OBJECTID, DOLARES, M2, U_S_M2, NEIGHBOURHOOD) %>% # selection of columns 
    rename(USD=DOLARES, 
           USDm2=U_S_M2)

kable(head(properties))
OBJECTID USD M2 USDm2 NEIGHBOURHOOD geometry
1 158000 102 1549.0 ALMAGRO POINT (-58.41696 -34.61665)
2 79000 37 2135.1 ALMAGRO POINT (-58.41579 -34.61083)
3 78000 41 1902.4 ALMAGRO POINT (-58.41711 -34.60585)
4 119999 56 2142.8 ALMAGRO POINT (-58.41579 -34.61083)
5 119000 45 2644.4 ALMAGRO POINT (-58.41865 -34.61427)
6 185000 58 3189.7 ALMAGRO POINT (-58.42882 -34.60408)

2.2 Spatial distribution of properties


What’s the geographical distribution of the price?

qpal <- colorQuantile('YlOrRd', properties$USDm2, n = 5)

leaflet(properties) %>%
  setView(-58.44104, -34.62264, zoom = 11) %>% 
  addTiles() %>%
  addProviderTiles(providers$CartoDB) %>%
  addCircles(lng = ~long, lat = ~lat, weight = 3,
             fillOpacity = .2, color = ~qpal(USDm2),
             popup = ~ paste('<strong>','Neighbourhood: ','</strong>', NEIGHBOURHOOD, '<br/>',
                             '<strong>', 'USD/m2: ', '</strong>', USDm2)) %>% 

   addLegend("bottomright", pal = qpal, values = ~USDm2, 
            title = "Price quintiles", opacity = 0.75) 


Higher values tend to cluster along the north precinct and plumb when moving towards the South of the City.

2.3 Non representative values

ggplot(properties)+
    geom_boxplot(aes(x=USDm2, y=NEIGHBOURHOOD, fill=NEIGHBOURHOOD), show.legend = FALSE) +
    scale_fill_viridis_d(option = 'magma')+
    labs(x='USD/m2',
         y='NEIGHBOURHOOD',
         title='Prices distribution by neighbourhood',
         subtitle='City of Buenos Aires')+
    theme_minimal()

Let’s find out the median price for each neighborhood to arrange the database:

mean_1 <- properties %>% 
    as.data.frame() %>% 
    select(-geometry) %>% 
    group_by(NEIGHBOURHOOD) %>% 
    summarise(MEAN_BEFORE=mean(USDm2))

properties <- left_join(properties, mean_1, by='NEIGHBOURHOOD')
p2 <- ggplot() +
    geom_density_ridges_gradient(data=properties, aes(x = USDm2, y = reorder(NEIGHBOURHOOD, -MEAN_BEFORE), 
                                                                                    fill = ..x..),
                                 color= 'grey20', scale = 3, rel_min_height = 0.01, show.legend = FALSE) +
            geom_point(data=properties, aes(x = USDm2, y = NEIGHBOURHOOD), color='grey10', size=.5, alpha=.5) +
  labs(title = 'Density plot per neighbourhood including extreme values') +
    scale_fill_viridis_c(option = 'magma', direction = -1)+
    labs (x='USD/m2',
          y='NEIGHBOURHOOD',
          title='Prices distribution by neighbourhood',
          subtitle='City of Buenos Aires')+
    theme_minimal()

p2

Now, it’s time to extract any extreme values, as they don’t represent and tend to bias and distort the results.

We’ll set the lower and higher limits as following:
\[ Lower\;outlier = Q1 - 1.5*IQR\\ Higher outlier = Q1 - 1.5*IQR \] \[ \]

Let’s get rid of them ASAP:

Q1=quantile(properties$USDm2, c(.25))
Q3=quantile(properties$USDm2, c(.75))
IQR= Q3-Q1

lim_low=Q1-1.5*IQR
lim_high=Q3+1.5*IQR
properties_filter <- filter(properties, USDm2 >lim_low & USDm2 < lim_high)

Done!

6% of the instances have been left aside. Alternatively, if we didnt’t want to lose any row, we could impute the values.

Now, let’s add the mean price for each neighbourhood to the main dataframe after bringing the outliers out of the sample.

mean_2 <- properties_filter %>% 
    as.data.frame() %>% 
    select(-geometry) %>% 
    group_by(NEIGHBOURHOOD) %>% 
    summarise(MEAN_AFTER=mean(USDm2))

properties_filter <- left_join(properties_filter, mean_2, by='NEIGHBOURHOOD')

# We'll be using differece this later:
MEAN_N <- mean_1 %>% 
    left_join(mean_2, by='NEIGHBOURHOOD') %>% 
    mutate(VAR=-(100-(MEAN_AFTER/MEAN_BEFORE)*100))

It now looks like this:

p3 <- ggplot() +
    geom_density_ridges_gradient(data=properties_filter, aes(x = USDm2, y = reorder(NEIGHBOURHOOD, -MEAN_AFTER), 
                                                                                    fill = ..x..),
                                 color= 'grey20', scale = 3, rel_min_height = 0.01, show.legend = FALSE) +
        geom_point(data=properties_filter, aes(x = USDm2, y = NEIGHBOURHOOD), color='grey10', size=.5, alpha=.5) +
  labs(title = 'Density plot per neighbourhood excluding extreme values') +
    scale_fill_viridis_c(option = 'magma', direction = -1)+
    labs (x='USD/m2',
          y='NEIGHBOURHOOD')+
    theme_minimal()

p3

We-ve managed to get a much more fitted data!

ggplot()+
    geom_point(data=properties, aes(x=M2, y=USD), color='darkred', size=1, alpha=.2) +
    geom_point(data= properties_filter, aes(x=M2, y=USD), color='burlywood', size=1, alpha=.5) +
        geom_smooth(data=properties, aes(x=M2, y=USD, color='All values'), method = "loess", alpha=.5, size=1.5, se = FALSE)+
    geom_smooth(data=properties_filter, aes(x=M2, y=USD, color='Representative values only'), method = "loess", size=1.5, alpha=.5, se = FALSE)+
    labs(x='TOTAL SURFACE (m2)',
         y='PRICE (USD)',
         title='Price/total surface',
         subtitle='City of Buenos Aires',
         caption = 'X axis uses log-10 scale', 
         colour='Condition')+
    scale_x_log10()+
    scale_color_manual(values=c('All values'='darkred', 
                               'Representative values only'='burlywood4'))+
    theme_minimal()

3. Spatial distribution


Is there any spatial pattern for the extreme values we can detect? This last sections aims to find out where the outliers are located within de city.

3.1 By neighbourhood

ggplot()+
    geom_point(data= properties, aes(x=USDm2, y=NEIGHBOURHOOD, color='All values'), show.legend = TRUE, size=1.5) +
    geom_point(data= properties_filter, aes(x=USDm2, y=NEIGHBOURHOOD, color='Representative values only'), show.legend = TRUE, size=1.5) +
      geom_vline(xintercept=lim_high, color="grey20", linetype="dashed", size=.5)+
    geom_vline(xintercept=lim_low, color="grey20", linetype="dashed", size=.5)+
    labs(x='TOTAL SURFACE (m2)',
         y='PRICE (USD)',
         title='Price/total surface',
         subtitle='City of Buenos Aires',
         caption = 'X axis uses log-10 scale',
         color='Condition')+
    scale_color_manual(values=c('All values'='darkred',
                               'Representative values only'='burlywood'))+
    theme_minimal()


Retiro, Recoleta, Puerto Madero and Belgrano tend to cluster the maximum extreme values.

Showing the values on a map makes much more sense. One thing left to do first, is to eliminate the non representative values:

neighbourhoods <- neighbourhoods %>% left_join(MEAN_N, by='NEIGHBOURHOOD')

properties_non_representative <- properties[!(properties$OBJECTID %in% properties_filter$OBJECTID),]

palette2 <- c(high="grey80", low= "darkred")


leaflet(properties_non_representative) %>%
  setView(-58.44104, -34.62264, zoom = 11) %>% 
  addTiles() %>%
  addProviderTiles(providers$CartoDB) %>%
    addPolygons(data=neighbourhoods, color = ~colorNumeric(palette2, neighbourhoods$VAR)(VAR),
                popup = ~ paste('<strong>','MEAN VARIATION AFTER OUTLIER TREATMENT','</strong>','<br/>',
                             '<strong>','Neighbourhood: ','</strong>', NEIGHBOURHOOD, '<br/>',
                             '<strong>', 'Variation: ', '</strong>', round(VAR,2), ' %')) %>% 
  addCircles(data=properties_filter, lng = ~long, lat = ~lat, weight = 3,
             fillOpacity = .2, color = 'burlywood',
             popup = ~ paste('<strong>','REPRESENTATIVE VALUE','</strong>','<br/>',
                             '<strong>','Neighbourhood: ','</strong>', NEIGHBOURHOOD, '<br/>',
                             '<strong>', 'USD/m2: ', '</strong>', USDm2)) %>% 
      addCircles(data=properties_non_representative, lng = ~long, lat = ~lat, weight = 3,
             fillOpacity = .2, color = 'darkred',
             popup = ~ paste('<strong>','NON REPRESENTATIVE VALUE','</strong>','<br/>',
                             '<strong>','Neighbourhood: ','</strong>', NEIGHBOURHOOD, '<br/>',
                             '<strong>', 'USD/m2: ', '</strong>', USDm2))


Thanks for reading!